Efficient spatial data partitioning for distributed $$k$$NN joins

نویسندگان

چکیده

Abstract Parallel processing of large spatial datasets over distributed systems has become a core part modern data analytic like Apache Hadoop and Spark. The general-purpose design these does not natively account for the data’s attributes results in poor scalability, accuracy, or prolonged runtimes. Spatial extensions remedy problem introduce recognition operations. At extension, locality-preserving partitioner determines how to spatially group dataset’s objects into smaller chunks using system’s available resources. Existing rely on sampling often mismanage non-spatial by either overlooking their memory requirements excluding them entirely. This work discusses various challenges that face partitioning proposes novel effectively queries datasets. For evaluation, proposed is integrated with well-known k -Nearest Neighbor ( $$k$$ k NN) join query. Several experiments evaluate proposal real-world Our approach differs from existing proposals (1) accounting unique traits without sampling, (2) considering computational overhead required handle data, (3) minimizing partition shuffles, (4) computing optimal utilization resources, (5) achieving accurate results. contributes through providing comprehensive discussion problems facing processing, development technique in-memory an effective, built-in, load-balancing methodology reduces query skews, Spark-based implementation NN Experimental tests show up $$1.48$$ 1.48 times improvement runtime as well accuracy

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

AdaptDB: Adaptive Partitioning for Distributed Joins

Big data analytics often involves complex join queries over two or more tables. Such join processing is expensive in a distributed setting both because large amounts of data must be read from disk, and because of data shuffling across the network. Many techniques based on data partitioning have been proposed to reduce the amount of data that must be accessed, often focusing on finding the best ...

متن کامل

BlockJoin: Efficient Matrix Partitioning Through Joins

Linear algebra operations are at the core of many Machine Learning (ML) programs. At the same time, a considerable amount of the effort for solving data analytics problems is spent in data preparation. As a result, end-toend ML pipelines often consist of (i) relational operators used for joining the input data, (ii) user defined functions used for feature extraction and vectorization, and (iii)...

متن کامل

Efficient Processing of Distributed Iceberg Semi-joins

The Iceberg SemiJoin (ISJ) of two datasets R and S returns the tuples in R which join with at least k tuples of S. The ISJ operator is essential in many practical applications including OLAP, Data Mining and Information Retrieval. In this paper we consider the distributed evaluation of Iceberg SemiJoins, where R and S reside on remote servers. We developed an efficient algorithm which employs B...

متن کامل

Efficient Top-k Spatial Distance Joins

Consider two sets of spatial objects R and S, where each object is assigned a score (e.g., ranking). Given a spatial distance threshold and an integer k, the top-k spatial distance join (k-SDJ) returns the k pairs of objects, which have the highest combined score (based on an aggregate function γ) among all object pairs in R×S which have spatial distance at most . Despite the practical applicat...

متن کامل

Optimal Partitioning for Spatial Data

It is desirable to design partitioning techniques that minimize the I/O time incurred during query execution in spatial databases. In this paper, we explore optimal partitioning techniques for spatial data for diierent types of queries. In particular, we show that hexagonal partitioning has optimal I/O cost for circular queries compared to all possible non-overlapping partitioning techniques th...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Journal of Big Data

سال: 2022

ISSN: ['2196-1115']

DOI: https://doi.org/10.1186/s40537-022-00587-2